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1. Introduction and previous work 

What is matrix deflation? A good, working definition is that it consists of the effective removal 
or projection of the low lying eigenmodes of a matrix in order to speed up additional calculations 
with the same matrix. It is known, for Hermitian systems, that the convergence rate for numerically 
solving linear equations is proportional to the condition number, which is defined as the square root 
of the ratio of the largest to the smallest eigenvalue, s/XnJXx. Reducing this ratio will effectively 
speed up the solution of the associated linear equations. For lattice QCD these small eigenvalues 
occur at small quark masses, and the general numerical problem being encountered is that of critical 
slow-down. 

The first discussion I am aware of where deflation is used in a lattice QCD calculation was in 
the plenary talk of Ref. [1], which reported some computer experiments for system of equations for 
multiple right-hand sides. Unfortunately, the results did not show great promise then because the 
eigenvalues of the systems studied were not sufficiently small. The application and use of a separate 
deflation step in the acceleration of the inner loop of overlap fermions, which are Hermitian shifted 
systems, is documented in Refs. [2] and [3]. A number of authors, including those of Ref. [4], 
have used deflation techniques in the context of calculating flavor-singlet propagators using the 
Hermitian Wilson-Dirac operator. See also Ref. [5] for a similar method using staggered fermions 
and Ref. [6] for a small review of such techniques. A technique for speeding up the calculation of 
meson two-point functions by averaging the low-lying eigenmodes over all spatial positions of the 
quarks was introduced in Ref. [7]; a similar method was also introduced in Ref. [8]. In addition, 
low-mode preconditioning in the so-called e-regime was studied in Ref. [9]. 

I will now describe some of the recent approaches used to implement deflation in the context 
of lattice QCD fermion inverters, concentrating on the methods developed in the context of Morgan 
and Wilcox's general non-hermitian deflation [10, 1 1], Stathopoulos' and Orginos' hermitian defla- 
tion [12], and Liischer's domain decomposition deflation [13]. Refs. [14] and [15], discussing the 
MorganAVilcox and Stathopoulos/Orginos algorithms, respectively, are included in these Proceed- 
ings. Preprints and talks given at this conference using deflation or related techniques are Usted in 
Refs. [16, 17]. 

2. Morgan's and Wilcox's work on non-hermitian deflation 

In the system of equations. Ax = h, where A is the known n^n matrix, K17I0V methods like 
GMRES and CG develop a polynomial solution in powers of A times ro, where ro is the residual 
vector. One can show the exact polynomial should have a zero at each eigenvalue in the complex 
plane. After m iterations, the Krylov subspace, K, is given by 

K = 5;7a«{ro,Aro,A2ro, ...A'"- Vo}. (2.1) 

The polynomial developed in this space has value unity at the origin in the (possibly complex) 
eigenvalue space. Therefore, for low m values it is necessary for this polynomial to make a sharp 
turn to zero near the origin if the system has small eigenvalues; this is why it is difficult for restarted 
methods to develop the correct small eigenvalue spectrum. Once the small eigenvalue spectrum 
has been developed, it is time-consuming and wasteful to re-develop the same small eigenvalue 
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spectrum for multiple uses of the matrix, such as occurs when multiple right-hand sides are solved. 
It would be better to try to effectively remove or deflate such eigenvalues, in the sense of the above 
introduction, to produce a more efficient simulation. This is the basic numerical problem faced by 
users of fermion matrix inverters in lattice QCD at small quark masses. 

Deflation has been studied in the context of restarted GMRES [18, 19, 20, 21, 22, 23, 24, 25, 
26, 27, 28]. One of these approaches is related to Sorensen's implicitly restarted Arnoldi method for 
eigenvalues [29] and is called GMRES with implicit restarting [27]. A mathematically equivalent 
method, called GMRES with deflated restarting (GMRES-DR) [28], is also related to Wu and 
Simon's restarted Arnoldi eigenvalue method [30], and is the method qualitatively described in 
this section. The MorganAVilcox work on deflation began in Ref. [31] in the context of GMRES, 
and was followed later by Ref. [32], which discusses incorporating multiple shifts and multiple 
right hand sides for GMRES and a deflated version of BiCGStab. The detailed description of these 
algorithms is given in [33, 10, 11]. 

An important feature of GMRES-DR is that it computes eigenvalues and solves linear equa- 
tions simultaneously. Approximate eigenvectors coiTcsponding to the small eigenvalues are com- 
puted at the end of each cycle and are put at the beginning of the next subspace. Letting ro be the 
initial residual for the linear equations at the start of the new cycle and , . . . va be harmonic Ritz 
vectors [34, 35, 36, 37], the subspace of dimension m used for the new cycle of GMRES-DR(m,fc) 
is 

Span{yi,y2, ■ ■ .j^,ro,Aro,AVo,AVo, . . . ,A'""*^"Vo}. (2.2) 

This can be viewed as a Krylov subspace generated with starting vector ro augmented with ap- 
proximate eigenvectors. Remarkably, the whole subspace turns out to be a Krylov subspace itself 
(though not with tq as starting vector) [27]. Once the approximate eigenvectors are moderately 
accurate, their inclusion in the subspace for GMRES essentially deflates the corresponding eigen- 
values from the linear equations problem. 

The approximate eigenvectors in GMRES-DR span a small Krylov subspace. They are gener- 
ated in a compact form by 

AVk = Vk+,Hk, (2.3) 

where Vt is a.nby k orthonormal matrix, Vt+i is the same except for an extra column, and Hk is 
a full k+l by k matrix. The columns of Vk span the subspace of approximate eigenvectors. At 
the end of a cycle of GMRES-DR, the harmonic Ritz values ai^e computed. The matrix Vk is then 
formed so that it is orthonormal and has columns spanning the harmonic Ritz vectors corresponding 
to desired harmonic Ritz values. Note this compact form is similar to an Arnoldi recurrence, and 
it allows access to both the approximate eigenvectors and their products with A while requiring 
storage of only k-\-l vectors of length n. 

Using the eigenvector information generated by GMRES -DR(m,^), subsequent systems are 
solved by combining restarted GMRES with a projection between cycles over the previously de- 
termined eigenvectors. The second and subsequent right-hand sides are solved much quicker than 
without the deflation. We call this method GMRES(m)-Proj(/:), where "k" is the number of deflated 
eigenvectors. 

An example of the use of GMRES-DR followed by GMRES-Proj is given in Fig. 1. This 
system consists of a 20^ x 32 quenched Wilson gauge configuration, evaluated essentially at Kcr- 
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Figure 1: The residual norm as a function of matrix vector products for various solvers on a quenched 
20^ X 32 Wilson configuration at fQ.^.. 




Figure 2: The approximate eigenspectrum associated with the matrix used in Fig. 1 calculated by GMRES- 
DR(70,50) with a residual norm stopping criterion of w 10^**. 



4 



Deflation Methods in Fermion Inverters 



Walter Wilcox 



0.1 



0.01 



0.001 




GMRES(20) 
GMRES-DR(30,10 
GMRES-DR(40,20) 
GMRES-DR(50,30) 
GMRES-DR(60,40) 
GMRES(20)-Proj(10) 
GMRES(20)-Proj(20) 
GMRES 20)-Proi 30 
GMRES 20 -Pro 40 



S 1e-04 



1e-05 



1e-06 



1e-07 



1e-08 



500 



1000 1500 2000 

Matrix-Vector Products 



2500 



3000 



Figure 3: The residual norm as a function of matrix vector products for various solvers on a quenched 
20^ X 32 twisted mass configuration. 



(In this and the following, we actually solve the even/odd preconditioned system.) In order to make 
the problem as realistic but difficult as possible, we have chosen a K" in each problem which makes 
the real part of the lowest Ritz eigenvalues essentially zero. The value used for this configuration, 
K = 0.15720, is very close to K^r = 0. 15691 [38]. Fig. 1 shows that GMRES(20) does not converge 
after 3000 iterations. It is incapable of forming the low eigenvector polynomial because of the small 
restart value. On the other hand, the various GMRES-DR residual norms show clearly that the 
correct small eigenvalues in the spectrum, seen in Fig. 2, are understood and dealt with efficiently. 
The evidence for this is the development of the sometimes sharp "knee" in the rate of convergence 
of the residual vector as a function of matrix vector products (MVPs). This phenomenon is called 
"super-linear convergence" and is seen near 1000 MVPs in this example. The various larger {m,k) 
pairs have a knee at approximately the same number of MVPs, but make an increasingly faster 
convergence after this point. The eigenvalues of each GMRES-DR(m,^) run are then passed on to 
GMRES(m — ^)-Proj(^). For exact comparison, we actually use the same right hand side in this and 
the other examples in this paper. Notice that the residual vector slopes of the GMRES-DR(m, k) and 
the corresponding GMRES(m — k)-Vm]{k) runs, after super-linear convergence, are approximately 
parallel, and that this slope is applied essentially from the beginning of the GMRES-Proj run, 
leading to a significantly faster congergence of the system. Also notice that there is only a small 
change in the residual curves for both GMRES-DR and GMRES-Proj when going from 40 to 50 
deflated eigenvalues. We say the residual "saturates" at this point. 
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The approximate low eigenvalue spectrum of this system, calculated with the initial run of 
GMRES-DR(70,50) is shown in Fig. 2. One can see the approximate positive/negative imaginary 
value symmetry in the spectrum as well as the fact that a single real eigenvalue appears very close 
to the origin, making the problem tough for any method. The stopping criterion on the residual 
norm on this run was 10^*^, which would be considered full convergence on the linear equation 
system. However, for increased convergence of GMRES-Proj on multiple right-hand sides it is 
often useful to run the GMRES-DR well past the point where the linear equations have converged. 

Results for GMRES-DR and GMRES-Proj on a 20^ x 32 twisted mass configuration, at = 
0.15728, /I = 0.005 are shown in Fig. 3. (This is the lowest maximally twisted quark mass con- 
sidered in Ref. [39].) Again, GMRES(20) fails to converge. The sharp knee in Fig. 1 signaling 
deflation is replaced by a more gradual slope change. In this case there is only a small change 
in the convergence properties in going from 30 to 40 deflated eigenvectors, after which there is 
saturation. This problem takes more iterations to converge than the Wilson one, and the gain from 
deflation (GMRES-Proj relative to GMRES-DR) is not as dramatic, but it still helps considerably. 
We will see later that the dynamical twisted mass case is also a tougher problem than the Wil- 
son/clover one in terms of the number of iterations rquired. 

We also consider matrix shifting for the GMRES-DR, GMRES-Proj algorithms, also called 
"multi-mass" because the shifts represents different quark masses in the Wilson formulation. Krylov 
subspaces are shift invariant in that A — Oil, where / is the identity matrix and (7/ are numbers, gen- 
erates the same Kiylov subspace as A no matter what the shift. Given this, the goal is to solve all 
shifted systems with this single Kiylov subspace. For non-restarted methods this has been done, 
for example, for QMR by Freund [40] and BiCGStab by Frommer [41]. 

Restarting makes the situation more difficult because the shifted residuals ai^e not parallel 
to one another, generating different Krylov subspaces. Frommer and Glassner, in the context of 
restarted GMRES [42] have provided a solution: force all residuals to all be parallel after a restart. 
One can then continue using a single Krylov subspace for all shifted systems. However, the minimal 
residual property is maintained only for the base shift system. In spite of this, we have not found this 
to be a problem. The residuals for the shifted systems for QCD problems are often indistinguishable 
from a non-shifted run at the same mass. 

The subspaces generated by GMRES-DR are a combination of approximate eigenvectors por- 
tion and Krylov subspace portion (see (2.2) above), but as pointed out above, they are Krylov 
themselves. Therefore, GMRES-DR can be restarted like GMRES for multiple shifts. 

Deflating eigenvalues is difficult for multiple shifts because one can not keep the residual vec- 
tors parallel unless one has exact eigenvectors. Lack of exact eigenvectors will cause the additional 
right-hand sides to "stall out". The solution is to force the error to be in the direction of one vector, 
namely v^+i from Eq.(2.3). One can then can correct the error at the end. One needs the solution 
of one extra right-hand side to accomplish this, but this cost will be low for many right-hand sides. 
The shifted version of GMRES-DR(m,A:) is called GMRES -DRS(m,)t) and the shifted version of 
GMRES(m)-Proj(A:) is called GMRES(m)-Proj(A:)-Sh. 

To show details on the method, consider Fig. 4. This uses a non-hermitain, bidiagonal matrix 
of order 2000 for illustration purposes, and is solved in MATLAB. The eigenvalue spectrum is 
0.1, 1,2,3, ... , 1999. The left part of the figure shows the solution of the first right-hand side, using 
GMRES-DRS(25,10). The solid blue line is the base system and the red dotted line is the shifted 
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Matrix- Vector Piodticta 



Figure 4: Illustration of shifting while solving 10 additional right-hand sides for a MATLAB matrix. The 
non-hermitian matrix used was of order 2000. 
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Figure 5: Illustration of shifting while projecting using GMRES-Proj-Sh on a new right-hand side for a 16'* 
Wilson matrix. 
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Figure 6: The residual norm as a function of matrix vector products on a 20^ x 32 Wilson lattice. Regular 
BiCGStab is compared to D-BiCGStab(A;) for A: =10, 20 or 30 low eigenvectors on the same lattice and 
right-hand side. 



one. The effect of deflation is seen again with the sharp knee in the residual norm. The next pair of 
falling lines represent the solution of the extra right-hand side, which is solved to lower accuracy. 
Next the second right-hand side is solved with GMRES(15)-Proj(10)-Sh. The green dotted line 
shows the uncorrected residual norm for the shifted system, which stalls out at k, 4.0 x 10~^, while 
the green dash-dot line shows the second shift residuals as if they were corrected. Actually, the 
correction needs to be done only once at the end of each right-hand side ! The new right-hand sides 
were only converged to 10^^, but that can easily be changed by running the base/shifted system 
longer. Notice the convergence is faster for the additional right-hand sides than for GMRES-DRS, 
because the eigenvectors are used from the beginning to speed up the convergence. Also the cost 
of GMRES(15)-Proj(10)-Sh is less than for GMRES-DRS(25,10), because it is fairly inexpensive 
to project over the approximate eigenvectors compared to keeping the eigenvectors in the GMRES- 
DR subspace. 

Doing multiple shifts while deflating using GMRES-DRS and GMRES-Proj-Sh is illustrated 
in Fig. 5 for a 16"* Wilson matrix at K" = 0.158. This is approximately Kcr- Here we just show 
the solution of the second right-hand side for the base system and two shifts, leaving out the so- 
lution of the extra right-hand side. The first right-hand side is solved with GMRES-DR(50,30) to 
residual tolerance of 10^^ and the extra right-hand side to 10^'. For the second right-hand side, 
GMRES-Proj uses 30 approximate eigenvectors for the projection between cycles of GMRES(20). 
GMRES(20)-Proj(30)-Sh converges in about one-tenth of the iterations needed for GMRES(20). 
To reach a residual norm of less than 10^^ for the base (unshifted) system takes 2680 matrix- vector 
products for GMRES(20)-Sh and 280 for GMRES(20)-Proj(30)-Sh. 
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Figure 8: The residual norm as a function of matrix vector products on a 24^ x 48 dynamical twisted mass 
configuration from the ETMC [44], evaluated at K" = 0.160859 and = 0.004, the smallest quark mass used 
in this work. 
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The GMRES methods are all restarted, and it is necessary to keep an associated Krylov set of 
vectors in memory. Non-restarted methods for large matrices often converge faster than restarted 
GMRES. For those who prefer not to restart but still want to solve multiple right-hand sides we 
have developed a deflated BiCGStab ("D-BiCGStab(^)") which deflates k eigenvectors. This has 
been described previously in Refs. [32] and [33]. The problem is that a simple projection over 
eigenvectors is not good enough to last for the entire run of BiCGStab. Our solution is to use a 
projection over both right and left approximate eigenvectors. We will see that this is very effec- 
tive, and in addition, the left eigenvectors may be obtained from the right eigenvectors for Wilson 
fermions using the (M^^)^ = yjM^^/s identity. 

Fig. 6 shows the effects of keeping different numbers of deflated eigenvectors on the residual 
norms as a function of MVPs during runs of D-BiCGStab(/:) on a 20^ x 32 Wilson lattice. Deflating 
about 50 eigenvectors on the 20^ x 32 seems to be optimal, after which residual norm saturation 
sets in. The typical speedup on a lattice of size is about 2.7 compared to regular BiCGStab, 
while the 20^ x 32 lattice speedup increases to about 5. Occasionally, an additional "breakthrough" 
speedup occurs; see examples in our PoS proceedings paper [14]. Again, these runs are done very 

close to Kcr- 

Figs. 7 and 8 compare the effect of GMRES-Proj and D-BiCGStab on dynamical configura- 
tions obtained from CP-PACS [43] and the ETMC [44], respectively. Notice the slow curvature of 
the residual curve for GMRES -DR(80,60) in Fig. 7, as opposed to the sharp knee of deflation in 
Fig. 1 for quenched Wilson fermions. This implies an almost continuous deflation of eigenvalues 
during the run. The final slope of the residual norm is significantly larger than the initial and this 
slope is applied essentially from the beginning of the GMRES(20)-Proj(60) or the D-BiCGStab(60) 
runs leading to a significant speedup. As we usually find, the non-restarted D-BiCGStab beats the 
GMRES-Proj at the end. Fig. 8 shows an even tougher problem, the lowest twisted mass from the 
ETM Collaboration. Regular BiCGStab does not converge. Two GMRES-DR curves are shown, 
for 30 and 50 eigenvectors, as well as corresponding D-BiCGStab runs. The superhnear effect 
of deflation, compared to dynamical Wilson/clover, is not as noticeable, just as for the quenched 
case. Again, D-BiCGStab slightly beats GMRES-Proj to full convergence. Note that for the D- 
BiCGStab run we had to prepare both left and right eigenvectors, which are separate calculations 
for the twisted case, as opposed to Wilson. However, the "down" eigenvectors are the "up" left 
eigenvectors, aside of a 75 factor, and vice versa. This means there is no extra work involved for 
left eigenmode preparation if both up and down quark propagators are always calculated. 

The optimal number of eigenvalues grows slowly with the size of the problem. On quenched 
lattice problems we have found that it increases from about 10 on small lattices (8^) to about 40 to 
50 on the largest lattices we have examined (20^ x 32), a scale change of over 60. Another very 
important question concerns the iteration count of the algorithm. If the number of iterations grows 
as the space-time volume of the lattice, V , then the number of computations is proportional to V^, 
given that the length of MVPs, which dominate the expense, is also proportional to V . We will 
call this the algorithm acceleration problem ^ Accordingly, after eigenvalues are deflated, how 
does the number of iterations vary with the size of the problem? We find that the average number 

'Luscher states a very specific way algoritiim growtii occurs during a QCD deflation process and calls it the "K^ 
problem". In contrast, I do not take a stand on how the algorithm growth comes about, but simply examine the growth 
in the number of the algorithm's algebraic steps during "peak" speedup. 
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Figure 9: The residual norm of the lowest extracted eigenvector of the Stathopoulos/Orginos algorithm 
compared to unrestarted Lanczos for a small Wilson lattice. 



of iterations in going from the 16"^ to the 20^ x 32 lattice (a volume change of 3.9) increased 
from about 300 to about 600 for GMRES-Proj and from about 300 to about 500 iterations for D- 
BiCGStab. Using the better results from D-BiCGStab and scaling the change in volume to 3 in 
order to compare with the other two algorithms, this represents an approximate 45% increase in 
the number of iterations. For comparison, the corresponding BiCGStab increase for this change is 
about 100%. 



3. Stathopoulos' and Orginos' work on CG deflation 

There are a number of problems in lattice QCD for which a non-hermitian approach is not 
necessary and possibly inefficient. The shifted inner loop of overlap fermions is one, the 75- 
hermitian Wilson-Dirac operator is another, and, if one is willing to use the normal equations, 
the Wilson/clover matrix is another. The optimal method is Conjugate Gradient (CG) for hermitian 
systems. Stathopoulos and Orginos have recently published a deflation algorithm using CG [12]. 

Their algorithm comes in three parts. The first part, eig-CG(nev, m), like GMRES-DR, solves 
lineal- equations and does simultaneous improvements of the deflated eigenvectors. Although the 
eigenvector calculation part of their algorithm is restarted, it does not affect the solution of the CG 
linear equations, which are not restarted, only inteiTupted. The second part, incremental eig-CG(5), 
where s labels the right-hand sides, can be thought of as a controller for eig-CG. It calls eig-CG, 
adds nev new eigenvectors to a separate, cumulative eigenvector subspace after each right-hand 
side, and does a Rayleigh-Ritz orthogonalization of the full set. It is used for the first s <s\ right- 
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Figure 10: The convergence history of the Statopoulos and Orginos lattice fermion solver for the 16^ x 64 
lattices (upper figure) and 2A^ x 64 lattices (lower figure) near m^r- 



hand sides. The third part, init-CG, is simply standard CG with an initial (and one additional) 
Galerkin projection using the completed set of eigenvectors for the final s> s\ right-hand sides. 

In some more detail, eig-CG(?iev, m) has a restarted subspace of maximum dimension m, which 
is made up of nev previous eigenvectors, nev current eigenvectors and (m — 2) * nev Krylov vectors. 
It uses Rayleigh-Ritz to compute the low eigenvectors and appends a portion of the CG search space 
to the eigenvectors. Typically, however, the linear equations converge faster than the eigenvalue 
part. Thus, something additional needs to be done to assure convergence, since eigenvalue accuracy 
is crucial to deflation. Incremental eig-CG(5) {s = 2,. ..) keeps track of the — 1) *nev previous 
eigenvectors, calls eig-CG, and accumulates another nev approximate Ritz vectors from each new 
right hand side, s, for s < s\. In the configuration run by Stathopoulos and Orginos, it needs 
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Figure 11: The quark mass dependence of the matrix vector operations of the Statopoulos and Orginos 
lattice fermion solver for the last 24 right-hand sides compared with the undeflated CG solve for the 16^ x 64 
lattices (upper figure) and 24^ x 64 lattices (lower figure). 



significant storage. 

Stathopoulos and Orginos tested tliis algoritlim on several anisotropic, 2 flavor Wilson fermion 
gauge fields, using tlie normal equations to solve for the hermitian combination M^M, where M is 
the Wilson-Dirac operator. The number of Wilson/Dirac MVPs (which are even/odd precondi- 
tioned) is therefore twice the number of the algorithm's iterations. There is an anisotropy factor 
in the time direction of 3 for their lattices. As tested, their algorithm uses single precision, ex- 
cept on dot products where double precision is used. They chose the maximum subspace size 
to be m = 100. In addition, nev =10, which means that a restart cycle (after the first) contains 
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100 — 2 * 10 = 80 iterations. They also tested for 48 total right-hand sides with s\ = 24. This was 
imlemented within Chroma, the lattice QCD C++ based software developed at Jefferson National 
Lab [45]. 

An interesting point that emerges in their numerical solves is that in spite of the fact that they 
restart the eigenvector part of their algorithm, it seems to converge as fast as unrestarted Lanczos. 
This is illustrated in Fig. 9. They ran various configurations of eig-CG(?iev, m) on a smaller Wilson 
matrix. When run with a small number of eigenvectors, the accuracy stagnates. However, keeping 
only 8 eigenvectors is sufficient to track unrestarted Lanczos. 

Fig. 10 gives the convergence history of the linear equations (not the eigenvectors) for the 
16^ X 64 and 24^ x 64 lattices, both near nicr, as a function of iteration count. Blue lines represent 
the first 24 right-hand sides using incremental eig-CG (calling eig-CG), the red lines the last 24 us- 
ing init-CG. Note the sub-linear convergence which occurs near single precision machine accuracy, 
10^^, which develops as more and more right hand sides are introduced. However, also notice the 
concomitant incremental decrease in the number of iterations necessary to solve each system. For 
the red lines, representing the last 24 init-CG runs, also note the quickly disappearing spike at the 
single restart, which is done as the systems again approach machine accuracy. 

The blue stars in Fig. 1 1 give the quark mass dependence of the iteration count needed for 
convergence on the last 24 right-hand sides. This figure also shows the same runs done with non- 
deflated CG with red circles. By comparing the approximate values given in this graph, one can 
deduce the approximate speedup compared to the original CG. The peak speedup is k, 10 on smaller 
lattice near 111^-, whereas the approximate integrated speedup, using Fig. 10, is f» 4 (all right-hand 
sides). For the larger lattices, the peak speedup is 7 on larger lattice near nic,-, and the integrated 
value is near 3. 

Another important aspect concerns the increase in the number of iterations of the algorithm 
for a given increase in the lattice volume. The two lattices examined, 16^ x 64 and 24^ x 64, have 
an approximate volume change of about 3.4. Fig. 11 shows that the iteration number on the last 
24 right-hand sides changes from about 250 on the smaller lattices to 800 on the larger near rticr- 
Scaling the iteration number increase to a volume increase factor of 3 instead of 3.4, this amounts 
to an approximate 190% increase in iterations, which is substantial. (The coiTcsponding increase 
in the number of iterations of the original CG is only 140%.) The algorithm would probably 
do better on the larger lattice if a larger number of eigenvectors were deflated, but this would 
increase the orthogonalization cost and possibly make the problem so large that it would no longer 
fit in computer memory. Also, double precision would help defeat the sub-Unear convergence. 
Nevertheless, the fiat behavior in Fig. 1 1 across quark masses and the peak speedup on these lattice 
is impressive. 



4. Luscher's work on domain decomposition deflation using GCR 

Luscher's domain decomposition algorithm introduces an interesting new type of deflation 
for lattice type systems. This algorithm breaks the lattice problem up into an "inner" deflation 
subspace, S (defined on blocks, of dimension N, defined later) and the "outer" orthogonal com- 



14 



Deflation Methods in Fermion Inverters 



Walter Wilcox 



plement, S . The basic deflated system is described as 

N 

Y{x) =x{x)+Z<^k{A-')ki{^,,ri), (4. 1) 

k.l 

where 

PLDx{x)=PL7lix),{l -Pr)x{x) = 0. (4.2) 

"D" is the Wilson-Dirac/clover operator, the (/)/ are basis fields on the blocks, and Aj^i is the matrix 
representing the Wilson-Dirac/clover operator on those blocks: 

A,, = (0,,D0,). (4.3) 

The definition of the projection operators Pl and Pr is 

N 

PMx) = Yix)-Y,DUA-')ki{(l>i,V), (4.4) 

k.l 

N 

Pr\I/{x) = va(^) (A- 1)^/(0,, Di/A). (4.5) 

k.l 

The outer part of Liischer's algorithm uses the Schwartz Alternating Procedure (SAP) matrix [46, 
47] as a right preconditioner on 4^ x 8 blocks and the Generahzed Conjugate Residual (GCR) [48] 
algorithm for the Krylov inverter on the 5^ space. After SAP preconditioning, the basic solve 
involves 

PLDMsAP^{x)=PL7lix), (4.6) 

X{x)=PrMsap(^{x). (4.7) 

Please see Refs. [49, 50] for more mathematical background on preconditioning and deflation on 
domains. 

The algorithm actually starts with subspace generation for later use in deflation. It applies 
inverse iteration (in the form of SAP) to A^^ (chosen to be 20) random global vectors to get linear 
combinations of low eigenmodes, y//, which are then projected onto the domains and orthonormal- 
ized. These are the block fields, 0/. The subspace is prepared a single time near m^r and does not 
need to be repeated. 

The algorithm solves the Wilson-Dirac/clover matrix on the full system knocking out the de- 
flated eigenvectors with the Pl projector while using the GCR algorithm. It considers one quark 
mass at a time. GCR allows for inexact preconditioning and works well with SAP. Preconditioning 
is important because it reduces the iteration count of GCR and therefore the deflation overhead. 
The dimensionality, A'^ = NgNb, where Nb is the number of blocks, of the inner part of the al- 
gorithm is fairly large. It is of size 20 X 2592 = 51,840 for the smaller lattice (24^ x 48) and 
20 X 8192 = 163, 840 for the larger (32^ x 48). It can not be solved exactly for a given source vec- 
tor, as one does on a typical Rayleigh-Ritz deflation subspace, but must be solved iteratively. It is 
even-odd preconditioned and solved with GCR. This inner space also is deflated, using the global 
vectors, used to prepare the deflation subspace. 
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Figure 12: Figure pertaining to the idea of "local coherence" within the context of domain decomposition. 
Low eigenmodes for free quarks on a periodic lattice are projected out using constant fields on domains. 
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Figure 13: Comparison of the solver times for even/odd preditioned BiCGStab ("EO+BiCGStab"), domain 
decomposition ("SAP+GCR"), and deflated domain decomposition ("DFL+SAP+GCR") on a 32^ x 64 lat- 
tice. 
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Deflation on domains works because of "local coherence". Only a small number of global 
vectors projected onto the blocks are needed to project out low modes. This is analogous to filter- 
ing out low modes for free quarks on a periodic lattice. Liischer illustrates this in Fig. 12, where 
constant fields on the domains are used to approximately project the low, smooth fields. It illus- 
trates that high deflation efficiencies can be achieved by subspaces of fields that are only piecewise 
smooth and are not approximate eigenmodes of the Dirac operator themselves. 

Liischer has done his development work within the context of dynamical 2 flavor Wilson/clover 
fermions (50 configurations). Fig. 13 gives the final results of the tests, outside of the overhead for 
subspace generation. Comparison is made between even/odd preconditioned BiCGStab, domain 
decomposition, and domain decomposition deflation on the larger of the two lattices. This figure 
gives the average execution time for the solution of the system as a function of inverse quark 
mass in lattice units. The dependences are approximately linear for all three algorithms. The total 
overhead for subspace generation was 150s for the 24^ x 48 lattice and 184s for 32^ x 64 lattice 
(tuned to nicr)- These overheads can not be compared to each other directly because they are done 
on different computer arrays. Using the Table 1 numbers from Ref. [13], the peak speedup of the 
algorithm at the lightest quark mass, relative to BiCGStab on the same system, is 366/32 = 1 1.4. 
The integrated speedup relative to BiCGStab may be computed in various ways depending on 
usage. For just the 5 mass solves used in Table 1 on the large lattice, one obtains 966/314 = 3.1 
when the overhead of 184s is included. A more typical usage would be to compute at least 1 quark 
propagator (= 12 solves) at each mass. Considering all preparatory work, including the initial inner 
Dirac solve, the integrated speedup factor is then (12*966)/(12*130-i-184-i-4*3.9)=6.6. 

The volume change in going from the small (24^ x 48) to large lattices (32^ x 48) is about 3.2. 
Scaling this down to a volume change of 3 again, Luscher's Table 1 shows there is an approximate 
13% increase in the number of outer GCR iterations for this change. Luscher's Table 2 also shows 
there is an accompanying increase of approximately 45% in the inner GCR iterations for the same 
volume change. Although this constitutes only a fraction of the time spent for the solution of the 
full system (20 — 30%), this fraction will increase for larger lattices. For comparison to the deflated 
algorithm's 13% increase in the number of outer iterations, note that the increase in the number of 
BiCGStab iterations for the same scale change is about 30%. 

The iterative growth of the MorganAVilcox and Liischer algorithms, relative to BiCGStab, 
seem to be very close. For the scale change of three, the iteration change ratio for MorganAVilcox 
is 45%/100% K. 0.45, and for Luscher it is 13%/30% « 0.43. Of course these come from very 
different sized lattices and the MorganAVilcox lattices are quenched. 

5. Comparison table 

For purposes of quick comparison of the three algorithms outlined above. Table 1 may be use- 
ful. The columns list the three algorithms and the rows give some of their major aspects. Some 
explanation is necessary. The "Matrix type" entry is pointing out that the MorganAVilcox and 
Stathopoulos/Orginos algorithms are general algebraic solvers, whereas Liischer's domain decom- 
position method works on a lattice-defined system. Although for the preconditioning question I an- 
swer in the negative for the MorganAVilcox and Stathopoulos/Orginos cases, both use an even/odd 
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MorganAVilcox 


Stath./Orginos 


Liischer 


Basic solvers 


GMRES; BiCGStab 


CG 


GCR 


Matrix type 


non-hermitian 


hermitian 


non-hermitian 




(algebraic) 


(algebraic) 


(lattice) 


Preconditioning? 


no 


no 


yes (SAP) 


Simultaneous solve? 


yes 


yes 


no 


Eigenvalue use 


Every cycle (GMRES-Proj) 


Every cycle (^ < ^i); 


Every outer 


for multiple rhs's 


At beginning (BiCGStab) 


beginning 
and restart {s> si) 


iteration 


Matrix shifting? 


yes (GMRES) 
no (BiCGStab) 


no 


no 


Algorithm 


mild 


large 


small 


acceleration 









Table 1: Some points of comparison for the three algorithms considered. 



solve for the problem addressed. (Liischer also uses such a preconditioning for his "inner" Dirac 
solve.) The "Simultaneous solve?" row concerns the question of whether the deflation steps are 
taking place simultaneously with the solution of the Unear equations. For the domain decomposi- 
tion algorithm this is a separate step, which, however, does not need to be repeated for the different 
quark masses. The "Eigenvalue use for multiple rhs's" entry is giving the frequency with which 
eigenvectors or linear combinations of eigenvectors are projected within the algorithm. For the 
Stathopoulos/Orginos algorithm this is done cyclically (cycle chosen is 80 iterations) for the first 
si right-hand sides, but are only used at the beginning (and one restart) for the remainder. For 
Liischer' s domain decomposition actually both the inner and outer GCR loops are deflated. The 
"Algorithm acceleration" row concerns how the iteration count increases for an increase in the size 
of the problem. We have chosen to list the MorganAVilcox increase as "mild", but a true com- 
parison with the matrices used by Liischer should be done for the same size matrices, where we 
would expect a continued increase in the effectiveness of our algorithms while maintaining slow 
eigenvalue number growth. 
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